1. Introduction

This project is based on the data science games 2017 hosted on Kaggle. It is a worldwide competition between universities. This year the data was provided by the music streaming application Deezer. They offer a recommendation feature, called “Flow”, which suggests the user songs they might like to listen to. The algorithm behind “Flow” uses collaborative filtering to provide the user with the right music at the right time. While the algorithm detects the user’s current listing taste by analysing the current listened and current skipped songs to improve the recommendations, the first song prediction is the most difficult one. Therefore, the goal of this challenge was to predict how likely it is that the users of the test dataset will listen to the first track of Flow.

In the end of the project we achieved the place 66 out of 145 teams with 45 submissions and a score of 0.63860. This report reflects our whole project journey, including our ideas, failures and achievements.

Therefore, the report will give a short introduction to the dataset at first. Sequentially, our approaches in the field of feature engineering are explained and the used prediction models are presented.

The libraries were used for the project are the following:

2. Data Analysis & Challenges

This chapter illustrates an exploratory analysis of the competition dataset. As it is mentioned before, the goal of this challenge was to predict whether the first recommended song will be listened or not. Not listened means, that the song was skipped in the first 30 seconds and it will be assumed, that the user does not like the song.

The test dataset contains the first recommended track for several different users. The train dataset was generated by the user’s listening history for one month. Each row represents one listened or not listened song.

Data:

Deezer <- read.csv("/home/Deezer/10_Basic_Dataset/train.csv")
Deezer_test <- read.csv("/home/Deezer/10_Basic_Dataset/test.csv")

tmp <- Deezer_test[,-1] 
tmp$is_listened <- 0
all <- rbind(Deezer,tmp)
rm(tmp)

The train data contains 7.56 million rows and 15 features, while the test set just contains 19.92 thousand rows.

The dataset provides the following columns, which we grouped in user specific, song specific and device specific features. These features are analyzed as well detections of NAs are conducted. As all features are characterized as numeric, we convert them into the right data type later on.

User specific: * user_id - anonymized and unique id of the user * user_gender - gender of the user * user_age - age of the user

Song specific: * media_id - identifiant of the song listened by the user * album_id - identifiant of the album of the song * artist_id - identifiant of the artist of the song * genre_id - identifiant of the genre of the song * media_duration - duration of the song * context_type - type of content where the song was listened: playlist, album … * release_date - release date of the song with the format YYYYMMDD

Device specific: * platform_name - type of operation system * platform_family - type of device

Other features: * ts_listen - timestamp of the listening in UNIX time (time since 1970-01-01 00:00, usually in seconds, sometime with milliseconds) * listen_type - if the songs was listened in a flow or not

Response variable: * is_listened - 1 if the track was listened longer than 30 seconds, 0 otherwise

2(a) User specific features

The data contains 19918 different users. Some of them occur over thousands of times, while nearly 1500 users occur just once (with one listened or non listened song). Non NAs are found.

user_distribution <- Deezer %>%
  group_by(user_id) %>%
  summarise(n=n()) %>%
  arrange (-n)
myColors <- rep(brewer.pal(7,"Blues")[5:6],50)

ggplot(user_distribution, aes(x=n))+
  geom_histogram(bins=100, fill=myColors,color="black")+
  scale_x_continuous(breaks=seq(0,2000,50),limits=c(0,2000))+
labs(subtitle="in the train set",
      y="number of users",
      x="number of songs",
      title="Distribution of Observations per User",
      caption="Source: Deezer train data")+
theme_light()+
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

#ggsave("observations_distribution.png",plot=gg6, width = 8, height=5, units="in")

The data reflects the listening behavior of male and female users. However, in this project we do not know how the values of 0 and 1 belong the specific genders. The plot below illustrates the amount of users per gender.

myColors <- brewer.pal(7,"Blues")[5:6]
ggplot(Deezer,aes(x=user_gender))+geom_bar(fill=myColors,color="black")+
labs(subtitle="in the train set",
      y="number of users",
      x="Gender",
      title="Distribution of Users Gender",
      caption="Source: Deezer train data")+
theme_light()

The user’s age is in the range of 18 and 30 with a median on 25. The following boxplot illustrates the distribution of the unique users by their age. Accordingly, 50% of the users are 24 years old or younger. The feature age does not contain any NAs in the Deezer dataset.

sum(is.na(Deezer$user_age))
## [1] 0
age <- Deezer %>%
  group_by(user_id) %>% summarise(age=max(user_age))
boxplot(age$age)

myColors <- c(rep(brewer.pal(7,"Blues")[5:6],6),"#4292C6")
ggplot(age, aes(x=age))+stat_bin(binwidth=1, fill=myColors, color="black") +  scale_y_continuous(limits=c(0,2200),breaks=seq(0,2200,100))+
    stat_bin(binwidth=1, geom="text", aes(label=..count..), vjust=-1.5)+ scale_x_continuous(breaks=seq(18,30,1))+
labs(y="number of users",
      x="age",
      title="Distribution of Age among Users",
      caption="Source: Deezer train data")+
theme_light()

#ggsave("Age_distribution.png",plot=gg7, width = 8, height=5, units="in")

2(b) Song specific features

The dataset contains 452975 different media_ids. One media_id presents one specific song from one artist on one specific album. At first, we thought that one media_id reflects one unique song. However, while checking the provided extra information (extra_infos.json), which links each media_id to a song title, album title and an artist name, it can be seen that one unique song (e.g. Everybody from Backstreet Boys) can have more than one media_id. This is caused by the album. Whenever a song is also on another album (e.g. Fetenkult - Best of the 90’s) a new media_id is generated.

One example is covered by the following data table:

extra = stream_in(file("/home/Deezer/10_Basic_Dataset/extra_infos.json"))
## opening file input connection.
## 
 Found 500 records...
 Found 1000 records...
 Found 1500 records...
 Found 2000 records...
 Found 2500 records...
 Found 3000 records...
 Found 3500 records...
 Found 4000 records...
 Found 4500 records...
 Found 5000 records...
 Found 5500 records...
 Found 6000 records...
 Found 6500 records...
 Found 7000 records...
 Found 7500 records...
 Found 8000 records...
 Found 8500 records...
 Found 9000 records...
 Found 9500 records...
 Found 10000 records...
 Found 10500 records...
 Found 11000 records...
 Found 11500 records...
 Found 12000 records...
 Found 12500 records...
 Found 13000 records...
 Found 13500 records...
 Found 14000 records...
 Found 14500 records...
 Found 15000 records...
 Found 15500 records...
 Found 16000 records...
 Found 16500 records...
 Found 17000 records...
 Found 17500 records...
 Found 18000 records...
 Found 18500 records...
 Found 19000 records...
 Found 19500 records...
 Found 20000 records...
 Found 20500 records...
 Found 21000 records...
 Found 21500 records...
 Found 22000 records...
 Found 22500 records...
 Found 23000 records...
 Found 23500 records...
 Found 24000 records...
 Found 24500 records...
 Found 25000 records...
 Found 25500 records...
 Found 26000 records...
 Found 26500 records...
 Found 27000 records...
 Found 27500 records...
 Found 28000 records...
 Found 28500 records...
 Found 29000 records...
 Found 29500 records...
 Found 30000 records...
 Found 30500 records...
 Found 31000 records...
 Found 31500 records...
 Found 32000 records...
 Found 32500 records...
 Found 33000 records...
 Found 33500 records...
 Found 34000 records...
 Found 34500 records...
 Found 35000 records...
 Found 35500 records...
 Found 36000 records...
 Found 36500 records...
 Found 37000 records...
 Found 37500 records...
 Found 38000 records...
 Found 38500 records...
 Found 39000 records...
 Found 39500 records...
 Found 40000 records...
 Found 40500 records...
 Found 41000 records...
 Found 41500 records...
 Found 42000 records...
 Found 42500 records...
 Found 43000 records...
 Found 43500 records...
 Found 44000 records...
 Found 44500 records...
 Found 45000 records...
 Found 45500 records...
 Found 46000 records...
 Found 46500 records...
 Found 47000 records...
 Found 47500 records...
 Found 48000 records...
 Found 48500 records...
 Found 49000 records...
 Found 49500 records...
 Found 50000 records...
 Found 50500 records...
 Found 51000 records...
 Found 51500 records...
 Found 52000 records...
 Found 52500 records...
 Found 53000 records...
 Found 53500 records...
 Found 54000 records...
 Found 54500 records...
 Found 55000 records...
 Found 55500 records...
 Found 56000 records...
 Found 56500 records...
 Found 57000 records...
 Found 57500 records...
 Found 58000 records...
 Found 58500 records...
 Found 59000 records...
 Found 59500 records...
 Found 60000 records...
 Found 60500 records...
 Found 61000 records...
 Found 61500 records...
 Found 62000 records...
 Found 62500 records...
 Found 63000 records...
 Found 63500 records...
 Found 64000 records...
 Found 64500 records...
 Found 65000 records...
 Found 65500 records...
 Found 66000 records...
 Found 66500 records...
 Found 67000 records...
 Found 67500 records...
 Found 68000 records...
 Found 68500 records...
 Found 69000 records...
 Found 69500 records...
 Found 70000 records...
 Found 70500 records...
 Found 71000 records...
 Found 71500 records...
 Found 72000 records...
 Found 72500 records...
 Found 73000 records...
 Found 73500 records...
 Found 74000 records...
 Found 74500 records...
 Found 75000 records...
 Found 75500 records...
 Found 76000 records...
 Found 76500 records...
 Found 77000 records...
 Found 77500 records...
 Found 78000 records...
 Found 78500 records...
 Found 79000 records...
 Found 79500 records...
 Found 80000 records...
 Found 80500 records...
 Found 81000 records...
 Found 81500 records...
 Found 82000 records...
 Found 82500 records...
 Found 83000 records...
 Found 83500 records...
 Found 84000 records...
 Found 84500 records...
 Found 85000 records...
 Found 85500 records...
 Found 86000 records...
 Found 86500 records...
 Found 87000 records...
 Found 87500 records...
 Found 88000 records...
 Found 88500 records...
 Found 89000 records...
 Found 89500 records...
 Found 90000 records...
 Found 90500 records...
 Found 91000 records...
 Found 91500 records...
 Found 92000 records...
 Found 92500 records...
 Found 93000 records...
 Found 93500 records...
 Found 94000 records...
 Found 94500 records...
 Found 95000 records...
 Found 95500 records...
 Found 96000 records...
 Found 96500 records...
 Found 97000 records...
 Found 97500 records...
 Found 98000 records...
 Found 98500 records...
 Found 99000 records...
 Found 99500 records...
 Found 1e+05 records...
 Found 100500 records...
 Found 101000 records...
 Found 101500 records...
 Found 102000 records...
 Found 102500 records...
 Found 103000 records...
 Found 103500 records...
 Found 104000 records...
 Found 104500 records...
 Found 105000 records...
 Found 105500 records...
 Found 106000 records...
 Found 106500 records...
 Found 107000 records...
 Found 107500 records...
 Found 108000 records...
 Found 108500 records...
 Found 109000 records...
 Found 109500 records...
 Found 110000 records...
 Found 110500 records...
 Found 111000 records...
 Found 111500 records...
 Found 112000 records...
 Found 112500 records...
 Found 113000 records...
 Found 113500 records...
 Found 114000 records...
 Found 114500 records...
 Found 115000 records...
 Found 115500 records...
 Found 116000 records...
 Found 116500 records...
 Found 117000 records...
 Found 117500 records...
 Found 118000 records...
 Found 118500 records...
 Found 119000 records...
 Found 119500 records...
 Found 120000 records...
 Found 120500 records...
 Found 121000 records...
 Found 121500 records...
 Found 122000 records...
 Found 122500 records...
 Found 123000 records...
 Found 123500 records...
 Found 124000 records...
 Found 124500 records...
 Found 125000 records...
 Found 125500 records...
 Found 126000 records...
 Found 126500 records...
 Found 127000 records...
 Found 127500 records...
 Found 128000 records...
 Found 128500 records...
 Found 129000 records...
 Found 129500 records...
 Found 130000 records...
 Found 130500 records...
 Found 131000 records...
 Found 131500 records...
 Found 132000 records...
 Found 132500 records...
 Found 133000 records...
 Found 133500 records...
 Found 134000 records...
 Found 134500 records...
 Found 135000 records...
 Found 135500 records...
 Found 136000 records...
 Found 136500 records...
 Found 137000 records...
 Found 137500 records...
 Found 138000 records...
 Found 138500 records...
 Found 139000 records...
 Found 139500 records...
 Found 140000 records...
 Found 140500 records...
 Found 141000 records...
 Found 141500 records...
 Found 142000 records...
 Found 142500 records...
 Found 143000 records...
 Found 143500 records...
 Found 144000 records...
 Found 144500 records...
 Found 145000 records...
 Found 145500 records...
 Found 146000 records...
 Found 146500 records...
 Found 147000 records...
 Found 147500 records...
 Found 148000 records...
 Found 148500 records...
 Found 149000 records...
 Found 149500 records...
 Found 150000 records...
 Found 150500 records...
 Found 151000 records...
 Found 151500 records...
 Found 152000 records...
 Found 152500 records...
 Found 153000 records...
 Found 153500 records...
 Found 154000 records...
 Found 154500 records...
 Found 155000 records...
 Found 155500 records...
 Found 156000 records...
 Found 156500 records...
 Found 157000 records...
 Found 157500 records...
 Found 158000 records...
 Found 158500 records...
 Found 159000 records...
 Found 159500 records...
 Found 160000 records...
 Found 160500 records...
 Found 161000 records...
 Found 161500 records...
 Found 162000 records...
 Found 162500 records...
 Found 163000 records...
 Found 163500 records...
 Found 164000 records...
 Found 164500 records...
 Found 165000 records...
 Found 165500 records...
 Found 166000 records...
 Found 166500 records...
 Found 167000 records...
 Found 167500 records...
 Found 168000 records...
 Found 168500 records...
 Found 169000 records...
 Found 169500 records...
 Found 170000 records...
 Found 170500 records...
 Found 171000 records...
 Found 171500 records...
 Found 172000 records...
 Found 172500 records...
 Found 173000 records...
 Found 173500 records...
 Found 174000 records...
 Found 174500 records...
 Found 175000 records...
 Found 175500 records...
 Found 176000 records...
 Found 176500 records...
 Found 177000 records...
 Found 177500 records...
 Found 178000 records...
 Found 178500 records...
 Found 179000 records...
 Found 179500 records...
 Found 180000 records...
 Found 180500 records...
 Found 181000 records...
 Found 181500 records...
 Found 182000 records...
 Found 182500 records...
 Found 183000 records...
 Found 183500 records...
 Found 184000 records...
 Found 184500 records...
 Found 185000 records...
 Found 185500 records...
 Found 186000 records...
 Found 186500 records...
 Found 187000 records...
 Found 187500 records...
 Found 188000 records...
 Found 188500 records...
 Found 189000 records...
 Found 189500 records...
 Found 190000 records...
 Found 190500 records...
 Found 191000 records...
 Found 191500 records...
 Found 192000 records...
 Found 192500 records...
 Found 193000 records...
 Found 193500 records...
 Found 194000 records...
 Found 194500 records...
 Found 195000 records...
 Found 195500 records...
 Found 196000 records...
 Found 196500 records...
 Found 197000 records...
 Found 197500 records...
 Found 198000 records...
 Found 198500 records...
 Found 199000 records...
 Found 199500 records...
 Found 2e+05 records...
 Found 200500 records...
 Found 201000 records...
 Found 201500 records...
 Found 202000 records...
 Found 202500 records...
 Found 203000 records...
 Found 203500 records...
 Found 204000 records...
 Found 204500 records...
 Found 205000 records...
 Found 205500 records...
 Found 206000 records...
 Found 206500 records...
 Found 207000 records...
 Found 207500 records...
 Found 208000 records...
 Found 208500 records...
 Found 209000 records...
 Found 209500 records...
 Found 210000 records...
 Found 210500 records...
 Found 211000 records...
 Found 211500 records...
 Found 212000 records...
 Found 212500 records...
 Found 213000 records...
 Found 213500 records...
 Found 214000 records...
 Found 214500 records...
 Found 215000 records...
 Found 215500 records...
 Found 216000 records...
 Found 216500 records...
 Found 217000 records...
 Found 217500 records...
 Found 218000 records...
 Found 218500 records...
 Found 219000 records...
 Found 219500 records...
 Found 220000 records...
 Found 220500 records...
 Found 221000 records...
 Found 221500 records...
 Found 222000 records...
 Found 222500 records...
 Found 223000 records...
 Found 223500 records...
 Found 224000 records...
 Found 224500 records...
 Found 225000 records...
 Found 225500 records...
 Found 226000 records...
 Found 226500 records...
 Found 227000 records...
 Found 227500 records...
 Found 228000 records...
 Found 228500 records...
 Found 229000 records...
 Found 229500 records...
 Found 230000 records...
 Found 230500 records...
 Found 231000 records...
 Found 231500 records...
 Found 232000 records...
 Found 232500 records...
 Found 233000 records...
 Found 233500 records...
 Found 234000 records...
 Found 234500 records...
 Found 235000 records...
 Found 235500 records...
 Found 236000 records...
 Found 236500 records...
 Found 237000 records...
 Found 237500 records...
 Found 238000 records...
 Found 238500 records...
 Found 239000 records...
 Found 239500 records...
 Found 240000 records...
 Found 240500 records...
 Found 241000 records...
 Found 241500 records...
 Found 242000 records...
 Found 242500 records...
 Found 243000 records...
 Found 243500 records...
 Found 244000 records...
 Found 244500 records...
 Found 245000 records...
 Found 245500 records...
 Found 246000 records...
 Found 246500 records...
 Found 247000 records...
 Found 247500 records...
 Found 248000 records...
 Found 248500 records...
 Found 249000 records...
 Found 249500 records...
 Found 250000 records...
 Found 250500 records...
 Found 251000 records...
 Found 251500 records...
 Found 252000 records...
 Found 252500 records...
 Found 253000 records...
 Found 253500 records...
 Found 254000 records...
 Found 254500 records...
 Found 255000 records...
 Found 255500 records...
 Found 256000 records...
 Found 256500 records...
 Found 257000 records...
 Found 257500 records...
 Found 258000 records...
 Found 258500 records...
 Found 259000 records...
 Found 259500 records...
 Found 260000 records...
 Found 260500 records...
 Found 261000 records...
 Found 261500 records...
 Found 262000 records...
 Found 262500 records...
 Found 263000 records...
 Found 263500 records...
 Found 264000 records...
 Found 264500 records...
 Found 265000 records...
 Found 265500 records...
 Found 266000 records...
 Found 266500 records...
 Found 267000 records...
 Found 267500 records...
 Found 268000 records...
 Found 268500 records...
 Found 269000 records...
 Found 269500 records...
 Found 270000 records...
 Found 270500 records...
 Found 271000 records...
 Found 271500 records...
 Found 272000 records...
 Found 272500 records...
 Found 273000 records...
 Found 273500 records...
 Found 274000 records...
 Found 274500 records...
 Found 275000 records...
 Found 275500 records...
 Found 276000 records...
 Found 276500 records...
 Found 277000 records...
 Found 277500 records...
 Found 278000 records...
 Found 278500 records...
 Found 279000 records...
 Found 279500 records...
 Found 280000 records...
 Found 280500 records...
 Found 281000 records...
 Found 281500 records...
 Found 282000 records...
 Found 282500 records...
 Found 283000 records...
 Found 283500 records...
 Found 284000 records...
 Found 284500 records...
 Found 285000 records...
 Found 285500 records...
 Found 286000 records...
 Found 286500 records...
 Found 287000 records...
 Found 287500 records...
 Found 288000 records...
 Found 288500 records...
 Found 289000 records...
 Found 289500 records...
 Found 290000 records...
 Found 290500 records...
 Found 291000 records...
 Found 291500 records...
 Found 292000 records...
 Found 292500 records...
 Found 293000 records...
 Found 293500 records...
 Found 294000 records...
 Found 294500 records...
 Found 295000 records...
 Found 295500 records...
 Found 296000 records...
 Found 296500 records...
 Found 297000 records...
 Found 297500 records...
 Found 298000 records...
 Found 298500 records...
 Found 299000 records...
 Found 299500 records...
 Found 3e+05 records...
 Found 300500 records...
 Found 301000 records...
 Found 301500 records...
 Found 302000 records...
 Found 302500 records...
 Found 303000 records...
 Found 303500 records...
 Found 304000 records...
 Found 304500 records...
 Found 305000 records...
 Found 305500 records...
 Found 306000 records...
 Found 306500 records...
 Found 307000 records...
 Found 307500 records...
 Found 308000 records...
 Found 308500 records...
 Found 309000 records...
 Found 309500 records...
 Found 310000 records...
 Found 310500 records...
 Found 311000 records...
 Found 311500 records...
 Found 312000 records...
 Found 312500 records...
 Found 313000 records...
 Found 313500 records...
 Found 314000 records...
 Found 314500 records...
 Found 315000 records...
 Found 315500 records...
 Found 316000 records...
 Found 316500 records...
 Found 317000 records...
 Found 317500 records...
 Found 318000 records...
 Found 318500 records...
 Found 319000 records...
 Found 319500 records...
 Found 320000 records...
 Found 320500 records...
 Found 321000 records...
 Found 321500 records...
 Found 322000 records...
 Found 322500 records...
 Found 323000 records...
 Found 323500 records...
 Found 324000 records...
 Found 324500 records...
 Found 325000 records...
 Found 325500 records...
 Found 326000 records...
 Found 326500 records...
 Found 327000 records...
 Found 327500 records...
 Found 328000 records...
 Found 328500 records...
 Found 329000 records...
 Found 329500 records...
 Found 330000 records...
 Found 330500 records...
 Found 331000 records...
 Found 331500 records...
 Found 332000 records...
 Found 332500 records...
 Found 333000 records...
 Found 333500 records...
 Found 334000 records...
 Found 334500 records...
 Found 335000 records...
 Found 335500 records...
 Found 336000 records...
 Found 336500 records...
 Found 337000 records...
 Found 337500 records...
 Found 338000 records...
 Found 338500 records...
 Found 339000 records...
 Found 339500 records...
 Found 340000 records...
 Found 340500 records...
 Found 341000 records...
 Found 341500 records...
 Found 342000 records...
 Found 342500 records...
 Found 343000 records...
 Found 343500 records...
 Found 344000 records...
 Found 344500 records...
 Found 345000 records...
 Found 345500 records...
 Found 346000 records...
 Found 346500 records...
 Found 347000 records...
 Found 347500 records...
 Found 348000 records...
 Found 348500 records...
 Found 349000 records...
 Found 349500 records...
 Found 350000 records...
 Found 350500 records...
 Found 351000 records...
 Found 351500 records...
 Found 352000 records...
 Found 352500 records...
 Found 353000 records...
 Found 353500 records...
 Found 354000 records...
 Found 354500 records...
 Found 355000 records...
 Found 355500 records...
 Found 356000 records...
 Found 356500 records...
 Found 357000 records...
 Found 357500 records...
 Found 358000 records...
 Found 358500 records...
 Found 359000 records...
 Found 359500 records...
 Found 360000 records...
 Found 360500 records...
 Found 361000 records...
 Found 361500 records...
 Found 362000 records...
 Found 362500 records...
 Found 363000 records...
 Found 363500 records...
 Found 364000 records...
 Found 364500 records...
 Found 365000 records...
 Found 365500 records...
 Found 366000 records...
 Found 366500 records...
 Found 367000 records...
 Found 367500 records...
 Found 368000 records...
 Found 368500 records...
 Found 369000 records...
 Found 369500 records...
 Found 370000 records...
 Found 370500 records...
 Found 371000 records...
 Found 371500 records...
 Found 372000 records...
 Found 372500 records...
 Found 373000 records...
 Found 373500 records...
 Found 374000 records...
 Found 374500 records...
 Found 375000 records...
 Found 375500 records...
 Found 376000 records...
 Found 376500 records...
 Found 377000 records...
 Found 377500 records...
 Found 378000 records...
 Found 378500 records...
 Found 379000 records...
 Found 379500 records...
 Found 380000 records...
 Found 380500 records...
 Found 381000 records...
 Found 381500 records...
 Found 382000 records...
 Found 382500 records...
 Found 383000 records...
 Found 383500 records...
 Found 384000 records...
 Found 384500 records...
 Found 385000 records...
 Found 385500 records...
 Found 386000 records...
 Found 386500 records...
 Found 387000 records...
 Found 387500 records...
 Found 388000 records...
 Found 388500 records...
 Found 389000 records...
 Found 389500 records...
 Found 390000 records...
 Found 390500 records...
 Found 391000 records...
 Found 391500 records...
 Found 392000 records...
 Found 392500 records...
 Found 393000 records...
 Found 393500 records...
 Found 394000 records...
 Found 394500 records...
 Found 395000 records...
 Found 395500 records...
 Found 396000 records...
 Found 396500 records...
 Found 397000 records...
 Found 397500 records...
 Found 398000 records...
 Found 398500 records...
 Found 399000 records...
 Found 399500 records...
 Found 4e+05 records...
 Found 400500 records...
 Found 401000 records...
 Found 401500 records...
 Found 402000 records...
 Found 402500 records...
 Found 403000 records...
 Found 403500 records...
 Found 404000 records...
 Found 404500 records...
 Found 405000 records...
 Found 405500 records...
 Found 406000 records...
 Found 406500 records...
 Found 407000 records...
 Found 407500 records...
 Found 408000 records...
 Found 408500 records...
 Found 409000 records...
 Found 409500 records...
 Found 410000 records...
 Found 410500 records...
 Found 411000 records...
 Found 411500 records...
 Found 412000 records...
 Found 412500 records...
 Found 413000 records...
 Found 413500 records...
 Found 414000 records...
 Found 414500 records...
 Found 415000 records...
 Found 415500 records...
 Found 416000 records...
 Found 416500 records...
 Found 417000 records...
 Found 417500 records...
 Found 418000 records...
 Found 418500 records...
 Found 419000 records...
 Found 419500 records...
 Found 420000 records...
 Found 420500 records...
 Found 421000 records...
 Found 421500 records...
 Found 422000 records...
 Found 422500 records...
 Found 423000 records...
 Found 423500 records...
 Found 424000 records...
 Found 424500 records...
 Found 425000 records...
 Found 425500 records...
 Found 426000 records...
 Found 426500 records...
 Found 427000 records...
 Found 427500 records...
 Found 428000 records...
 Found 428500 records...
 Found 429000 records...
 Found 429500 records...
 Found 430000 records...
 Found 430500 records...
 Found 431000 records...
 Found 431500 records...
 Found 432000 records...
 Found 432500 records...
 Found 433000 records...
 Found 433500 records...
 Found 434000 records...
 Found 434500 records...
 Found 435000 records...
 Found 435500 records...
 Found 436000 records...
 Found 436500 records...
 Found 437000 records...
 Found 437500 records...
 Found 438000 records...
 Found 438500 records...
 Found 439000 records...
 Found 439500 records...
 Found 440000 records...
 Found 440500 records...
 Found 441000 records...
 Found 441500 records...
 Found 442000 records...
 Found 442500 records...
 Found 443000 records...
 Found 443500 records...
 Found 444000 records...
 Found 444500 records...
 Found 445000 records...
 Found 445500 records...
 Found 446000 records...
 Found 446500 records...
 Found 447000 records...
 Found 447500 records...
 Found 448000 records...
 Found 448500 records...
 Found 449000 records...
 Found 449500 records...
 Found 450000 records...
 Found 450500 records...
 Found 451000 records...
 Found 451500 records...
 Found 452000 records...
 Found 452500 records...
 Found 453000 records...
 Found 453310 records...
 Imported 453310 records. Simplifying...
## closing file input connection.
extra_example <- extra %>%
  filter(sng_title=="Everybody (Backstreet's Back)" & art_name=="Backstreet Boys")
data.table(extra_example)
##    media_id                     sng_title
## 1: 70222244 Everybody (Backstreet's Back)
## 2:  4254236 Everybody (Backstreet's Back)
## 3: 15391051 Everybody (Backstreet's Back)
## 4:   834740 Everybody (Backstreet's Back)
## 5:   632276 Everybody (Backstreet's Back)
## 6: 14164455 Everybody (Backstreet's Back)
## 7:   605957 Everybody (Backstreet's Back)
##                                                         alb_title
## 1:                                  The Essential Backstreet Boys
## 2: Essential Pop Anthems: Classic 80s, 90s and Current Chart Hits
## 3:                                                Backstreet Boys
## 4:                                               新好男孩超級精選
## 5:                                              Backstreet's Back
## 6:                                               The Very Best Of
## 7:                                   Fetenkult - Best Of The 90's
##           art_name
## 1: Backstreet Boys
## 2: Backstreet Boys
## 3: Backstreet Boys
## 4: Backstreet Boys
## 5: Backstreet Boys
## 6: Backstreet Boys
## 7: Backstreet Boys

Furthermore, we can count 151471 different albums and 67142 different artists in the dataset. No NAs are identified.

sum(is.na(Deezer$media_id))
## [1] 0
sum(is.na(Deezer$album_id))
## [1] 0
sum(is.na(Deezer$artist_id))
## [1] 0

According to the extra information from the json file, we know that media_id and album_id are strongly correlated with each other. The correlation coefficient in that case is: 0.998958. The following histogram underlines this correlation. Furthermore, some media_ids (and therefore some albums and artists) are recommended quite often over all users.

par(mfrow=c(1,3))
hist(Deezer$media_id)
hist(Deezer$album_id)
hist(Deezer$artist_id)

Likewise, 2922 different genres exist in the dataset. The histogram shows clearly that one specific genre with id = 0 occurs 3 million times, while the second one for example just occurs not even 1 million times. This would mean, that one specific genre is recommended the most. However, further investigations are done in the feature engineering chapter.

sum(is.na(Deezer$genre_id))
## [1] 0
number_genre <- Deezer %>%
  group_by(genre_id) %>%
  summarize(n=n()) %>%
  arrange(-n)
head(number_genre)
## # A tibble: 6 × 2
##   genre_id       n
##      <int>   <int>
## 1        0 3666789
## 2        7  929538
## 3       10  288272
## 4       25  269032
## 5       27  187946
## 6       14  177117
myColors <- c(rep(brewer.pal(7,"Blues")[5:6],25))

ggplot(number_genre, aes(genre_id)) + 
  geom_histogram(bins=50,fill=myColors, color="black")+ 
  scale_y_continuous(limits=c(0,750),breaks=seq(0,750,100))+
labs(y="number of observations",
      x="genre_id",
      title="Distribution of Genre_id",
      caption="Source: Deezer train data")+
theme_light()

media_duration - duration of the song We were also provided with some information about the media duration. But a quick look at the summary will reveal a couple of outliers.

summary(Deezer$media_duration)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0   196.0   222.0   231.2   254.0 65535.0

For this visualisation part, we filtered out <>q95 and kept all the values between 158 and 333.

myColors <- c(rep(brewer.pal(7,"Blues")[5:6],15))

ggplot(Deezer%>%filter(media_duration>=quantile(media_duration,0.05) & media_duration<=quantile(media_duration,0.95)), aes(media_duration)) + 
  geom_histogram(bins=30,fill=myColors, color="black")+
labs(y="number of observations",
      x="media duration in seconds",
      title="Distribution of Mediaduration",
     subtitle="for 90% of the data",
      caption="Source: Deezer train data")+
theme_light()

(?) context_type - type of content where the song was listened: playlist, album …

sum(is.na(Deezer$context_type))
## [1] 0
hist(Deezer$context_type)

data.table(table(Deezer$context_type))
##     V1       N
##  1:  0 3198365
##  2:  1 1617653
##  3:  2 1052844
##  4:  3  433456
##  5:  4  230259
##  6:  5  167428
##  7:  6   98523
##  8:  7   97645
##  9:  8   79330
## 10:  9   71572
## 11: 10   71026
## 12: 11   68775
## 13: 12   58460
## 14: 13   55830
## 15: 14   21641
## 16: 15   20826
## 17: 16   20725
## 18: 17   20075
## 19: 18   18933
## 20: 19   17973
## 21: 20   17489
## 22: 21   16918
## 23: 22   16625
## 24: 23   13168
## 25: 24   11200
## 26: 25   10206
## 27: 26    6158
## 28: 27    5017
## 29: 28    4256
## 30: 29    3739
## 31: 30    3365
## 32: 31    3038
## 33: 32    2985
## 34: 33    2704
## 35: 34    2263
## 36: 35    2128
## 37: 36    1977
## 38: 37    1822
## 39: 38    1746
## 40: 39    1688
## 41: 40    1211
## 42: 41     984
## 43: 42     912
## 44: 43     868
## 45: 44     658
## 46: 45     593
## 47: 46     510
## 48: 47     474
## 49: 48     375
## 50: 49     278
## 51: 50     278
## 52: 51     265
## 53: 52     206
## 54: 53     162
## 55: 54     153
## 56: 55     147
## 57: 56     127
## 58: 57     118
## 59: 58     112
## 60: 59     100
## 61: 60      98
## 62: 61      92
## 63: 62      91
## 64: 63      72
## 65: 64      35
## 66: 65      33
## 67: 66      27
## 68: 67       7
## 69: 68       7
## 70: 69       3
## 71: 70       2
## 72: 71       2
## 73: 72       2
## 74: 73       1
##     V1       N

(?) release_date - release date of the song with the format YYYYMMDD

sum(is.na(Deezer$release_date))
## [1] 0

2(c) Device specific features

The dataset contains also information about the device and the operating system the user used for listening. Both features have 3 values (1:3). As it seems that device 1 and the operating system 1 is used the most, we cannot say which specific device or system it is. Furthermore, these both columns are correlated (r=0.5)

par(mfrow=c(1,2))
plot(as.factor(Deezer$platform_family))
plot(as.factor(Deezer$platform_name))

sum(is.na(Deezer$platform_family))
## [1] 0
sum(is.na(Deezer$platform_name))
## [1] 0

2(d) Other features, incl. response variable

The ts_listen feature includes the time information when a specific song was recommended to a user. It is a UNIX format and will be modified later on. Non NAs are detected.

sum(is.na(Deezer$ts_listen))
## [1] 0

As the training data covers the overall listening behavior of an user, the song history includes songs which were listened in the flow and songs which were not recommended (listened by search or saved song lists). Thereby, 5.24 million rows do not presents recommended songs, while 2.32 were predicted by the recommendation engine Flow. The test data contains just predicted songs by Flow.

myColors <- brewer.pal(7,"Blues")[5:6]
ggplot(Deezer,aes(x=listen_type))+geom_bar(fill=myColors,color="black")+
labs(subtitle="in the train set",
      y="number of observations",
      x="listening type",
      title="Distribution of Listening Type",
      caption="Source: Deezer train data")+
theme_light()

table(Deezer_test$listen_type)
## 
##     0     1 
##     1 19917
sum(is.na(Deezer$listen_type))
## [1] 0

Interestingly, we detected one single song inside the test set, which is of listening_type==0.

At last, the most important column “is_listened” presents the response variable in the modeling part.

myColors <- rep(brewer.pal(7,"Blues")[5:6],2)
ggplot(Deezer,aes(x=is_listened))+geom_bar(fill=myColors,color="black")+
labs(subtitle="in the train set",
      y="number of observations",
      x="listening type",
      title="Distribution of Is Listened versus Listening Type",
      caption="Source: Deezer train data")+
theme_light()+facet_wrap(~listen_type)

Here we can see that songs which weren´t suggested (e.g. searched for or listened to in a playlist) have an higher chance of getting listened 0.721113 versus songs which have been suggested by the flow 0.6002817.

2(e) Summary of main challenges

At the first glimpse, the dataset looks like tall data. However, most of the features are factors. As already mentioned in the previous chapter, the data contains 19918 different users, 2922 genres, 67142 artists, 151471 albums and 452975 unique songs.

Therefore, the main challenge of that project was how to create a prediction model with this amount of factors. Furthermore, insufficient data of the column genre_id and a to grained timestamp and release date were the features which needed some improvements.

Furthermore, the following correlation plot visualizes the collinearity between the variables. (? here we can write more about important features…)

corDeezer <- cor(Deezer)
#png("correlationMatrix.png")
corDeezerplot <- corrplot(corDeezer, method = "ellipse", type = "full")

#dev.off()

3. Feature Engineering

In this part, we will talk about our Feature Engineering. We will start with easier transformatings and calculations going into advanced thinking processes throughout this chapter. We tried to demonstrate all we have archived and how we got chronologically to the point of most success.

3(a) Timestamp conversion

Converting the Timestamp is the most basic feature engineering we have done. We extract Hour of Day as well as weekday from the timestamp. This procedure is so easy, that we did it on the fly, as you will see at the end of the feature engineering section. Nethertheless, here is what will happen.

The idea behind that is the assumption, that the Hour of a Day or/and the Day of Week contributes patterns to the prediction, e.g. if user goes regularly to the gym on tuesday and thursday at 8pm, or if users tend to listen to slower songs in the evening, while they could listen to fast songs on the way to work. This patterns cannot be found with a time precision of seconds.

transforming_timestamp <- function(x){
timestamp = as.POSIXct(x$ts_listen, origin="1970-01-01")
  splitdt  <- data.frame(
    hh = as.numeric(format(timestamp, format = "%H")), #24hours format
    wd = as.numeric(format(timestamp, format = "%w"))) #weekday
  x = cbind(x, splitdt) #cbind the extracted time to data
  return(x)
}
transforming_timestamp(Deezer)[1:10,12:17]
##    user_id artist_id user_age is_listened hh wd
## 1     9241     55164       29           0 14  4
## 2    16547     55830       30           1 23  3
## 3     7665      2704       29           1 14  6
## 4     1580       938       30           0 10  6
## 5     1812      2939       24           1 19  6
## 6     1812      2939       24           1 22  6
## 7     1812      2939       24           1 10  6
## 8    10325      2939       29           1 14  6
## 9     1812      2939       24           1 14  1
## 10      51      2939       28           1 13  2
tmp <- transforming_timestamp(Deezer)
myColors <- rep(brewer.pal(7,"Blues")[5:6],12)

ggplot(tmp, aes(x=hh))+stat_bin(binwidth=1,fill=myColors, color="black")+
  scale_y_continuous(limits=c(0,600000),breaks=seq(0,600000,50000))+
  scale_x_continuous(breaks=seq(0,23,1))+
labs(y="observations",
      x="hour of day",
      title="Distribution of Listening Hour",
      caption="Source: Deezer train data")+
theme_light()+
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

#ggsave("hh_distribution_train.png",plot=gg9, width = 8, height=5, units="in")
myColors <- c(rep(brewer.pal(7,"Blues")[5:6],3),"#4292C6")

ggplot(tmp, aes(x=wd))+stat_bin(binwidth=1,fill=myColors, color="black")+
  scale_y_continuous(limits=c(0,1300000),breaks=seq(0,1300000,100000))+
  scale_x_continuous(breaks=seq(0,6,1))+
labs(y="observations",
      x="weekday",
      title="Distribution of Listening Weekday",
      caption="Source: Deezer train data")+
theme_light()

#ggsave("wd_distribution_train.png",plot=gg10, width = 8, height=5, units="in")
tmp <- transforming_timestamp(Deezer_test)
myColors <- rep(brewer.pal(7,"Blues")[5:6],12)

  ggplot(tmp, aes(x=hh))+stat_bin(binwidth=1,fill=myColors, color="black")+
  scale_y_continuous(limits=c(0,1800),breaks=seq(0,1800,100))+
  scale_x_continuous(breaks=seq(0,23,1))+
labs(y="observations",
      x="hour of day",
      title="Distribution of Listening Hour",
     subtitle="for test data",
      caption="Source: Deezer test data")+
theme_light()+
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

#ggsave("hh_distribution_test.png",plot=gg11, width = 8, height=5, units="in")
myColors <- c(rep(brewer.pal(7,"Blues")[5:6],3),"#4292C6")

ggplot(tmp, aes(x=wd))+stat_bin(binwidth=1,fill=myColors, color="black")+
  scale_y_continuous(limits=c(0,5000),breaks=seq(0,5000,250))+
  scale_x_continuous(breaks=seq(0,6,1))+
labs(y="observations",
      x="weekday",
      title="Distribution of Listening Weekday",
     subtitle="for test data",
      caption="Source: Deezer test data")+
theme_light()

#ggsave("wd_distribution_test.png",plot=gg12, width = 8, height=5, units="in")

3 (b) Beats per Minute

Beats per Minute (bpm) is an easy, numeric feature which we could query from the API. We thought it may help us differentiate inside of genres for example “softrock” being tendentially slower (lower bpm) than its assosiated genre “rock”. It wasn´t our intention to create subgenres in any way, but just giving this extra information into the model.

For quering the API we used the httr package for the query and jsonlite to handle the received answer. To show how the code works, we limit the number of queries to 50 instead of nearly 450.000 media_ids

## Load necessary packages
library(jsonlite)
library(httr)
uniquetracks <- as.data.frame(unique(all$media_id)[1:50]) #getting all unique media_ids
uniquetracks$api <- paste("https://api.deezer.com/track/",uniquetracks[,1], sep="")
uniquetracks$bpm <- 0

for (i in 1:length(uniquetracks[,1])) { ## for all tracks
    this.raw.result <- GET(url = as.character(uniquetracks[i,2])) ## get the infos
    this.result <- fromJSON(rawToChar(this.raw.result$content)) ## turn it into a readable format
    uniquetracks$bpm[i] <- ifelse(is.null(this.result$bpm),"NA",this.result$bpm) ## get the BPM
    #message(as.character(i), appendLF = FALSE) ## print the iteration to see if the code is still working
    Sys.sleep(time = 0.05) ## cap the speed so the 50 per 5 seconds are not violated
}
head(uniquetracks,10)
##    unique(all$media_id)[1:50]                                 api   bpm
## 1                      222606 https://api.deezer.com/track/222606 100.1
## 2                      250467 https://api.deezer.com/track/250467 160.2
## 3                      305197 https://api.deezer.com/track/305197  90.1
## 4                      900502 https://api.deezer.com/track/900502 124.0
## 5                      542335 https://api.deezer.com/track/542335 120.0
## 6                      542341 https://api.deezer.com/track/542341 123.0
## 7                      542346 https://api.deezer.com/track/542346 163.0
## 8                      542340 https://api.deezer.com/track/542340  95.0
## 9                      542347 https://api.deezer.com/track/542347 117.0
## 10                     549742 https://api.deezer.com/track/549742 120.0
#save(tracks, file = "tracks_BPM.rda")

After getting the information for the API, we need to clean it. Because the bpm is received as character, we can set all cells where the string is “NA” to 0 without using is.na(). Transforming them to numeric afterwards leads to 0 missing data, but 30000 zeros (only 171 were introduced by setting “NA” to 0).

load("/home/Deezer/30_Wrangled_Data/Archiv/tracks_BPM.rda")
tracks <- tracks[,c(2,3)]
tracks$track_bpm[tracks$track_bpm=="NA"] <- "0"
tracks$track_bpm <- as.numeric(tracks$track_bpm)

myColors <- rep(brewer.pal(7,"Blues")[5:6],20)

ggplot(tracks, aes(x=track_bpm))+stat_bin(bins=40, fill=myColors, color="black")+
  scale_y_continuous(limits=c(0,46000),breaks=seq(0,45000,5000))+
  scale_x_continuous(breaks=seq(0,240,10))+
labs(title="Distribution of Beats per Minute",
     y="observations",
      x="beats per minute",
      caption="Source: Deezer train data")+
theme_light()+
  theme(axis.text.x = element_text(angle = 75, hjust = 1))

To fill the zeros, we will take the mean bpm from the album the media is on, or the genre it is from, if the album information is not available. Sadly, at this point we didn´t had informations about the genre, so in order to keep it chronologically correct and not confusing, we will go to the genre section now and will come back to this afterwards.

3 (c) Discarded: EchoNest, singer’s gender, users age, user’s language vs. song language

For increasing the available information we looked for additional sources, like API provider. And we found EchoNest and an R interface (https://github.com/mukul13/rechonest). EchoNest provides a lot of information about songs, albums, artists, etc. Unfortunately, EchoNest was bought by Deezers market competitor Spotify. Furthermore, we found a hint in the DSG forum, that we can use the Deezer API, but are not allowed to use other sources.

In a brainstorming we thought about additional factors, which influences, if we want to listen to a song or not. One aspect is the timbre of a voice or the gender of an artist, e.g. in combination with a genre. Since we were not allowed to use EchoNest and these information are not available in Deezers API, we discarded this idea.

If one is thinking to the parents music taste, one could claim, that elder people tend to listen to older songs. One explanation could be, that user tend to listen to songs from their “youth”. The question was, if the properties user_age and year of release_date correlate. Unfortunately, user_age and release_date were not significant in our first models.

User’s language vs. song language ToDo: Pranav, what happend with that approach?

3 (d) Genre_id

In a first approach we tried to get enhanced information about the genre_id, by quering the API for the genre_ids. As seen in the visualisation part, we were confronted with a vast amount of genre_id==0 and it would be nice to know which genre it is.

uniquegenres <- unique(all$genre_id) #getting all unique genre_ids
uniquegenres <- as.data.frame(paste("https://api.deezer.com/genre/",uniquegenres, sep="")) #paste the ids into the needed API-url to access the informations
uniquegenres$name <- "" #initialise a empty column
uniquegenres$genre_id <- unique(all$genre_id)

At this point we are able to query from the API, using the following loop. To show the results, we set the sample size to 50.

library(httr)
library(jsonlite)

for (i in 1:50){ #first 500 genres
    this.raw.result <- GET(url = as.character(uniquegenres[i,1])) # get data
    this.result <- fromJSON(rawToChar(this.raw.result$content)) # turn data from unix to char and from json into a variable
    uniquegenres$name[i] <- ifelse(is.null(this.result$name),"NA",this.result$name) # fill NA where NULL
    Sys.sleep(time = 0.1) # System Sleep time to not overload the APIs capacity of 50 requests every 5 seconds
}
head(uniquegenres[,c(2,3)],10)
##                       name genre_id
## 1                       NA    25471
## 2                       NA    25571
## 3        Musique asiatique       16
## 4         Seggae mauritien        7
## 5                 Afro Pop        3
## 6                       NA     2519
## 7     Musique indonésienne       25
## 8                       NA       50
## 9                  Zouglou       10
## 10 Spiritualité & religion      228

As we can see, the genre names are mainly NAs, and if one would look at row number 21, we would see, that genre_id==0 refers to “all”. All, 0

Now that we know, that we can and should not rely on the genre_id, we need to get the information about the genre from another source. Luckily, we have the album_id and the API provides genre information for each album individually, which is not related to the genre_id in first place.

Genre Information from album_id
albums <- unique(all$album_id) #getting all album ids
albums <- paste("https://api.deezer.com/album/",albums, sep="") #get them in the right format for query the API
albums <- as.data.frame(albums)
albums$album_id <- unique(all$album_id)

#initiating some columns
albums$alb.genre <- ""
albums$alb.genre.id <- ""

In the next chunk we receive the genre information for the first 50 albums.

for (i in 1:50){ #for 50 albums. you can replace 50 by length(albums$albums) to loop over all
    this.raw.result <- GET(url = as.character(albums[i,1])) #get the infos
    this.result <- fromJSON(rawToChar(this.raw.result$content)) #turn it into a readable format
    albums$alb.genre[i] <- ifelse(is.null(this.result$genres$data[2]),"NA",this.result$genres$data[2]) #getting the written genre
    albums$alb.genre.id[i] <- ifelse(is.null(this.result$genre_id),"NA",this.result$genre_id) #getting the genre_id from the album
    #message(as.character(i), appendLF = FALSE) #print the iteration to see if the code is still working
    Sys.sleep(time = 0.05) #cap the speed so the 50 per 5 seconds are not violated
}

formattable(head(albums,50))
albums album_id alb.genre alb.genre.id
https://api.deezer.com/album/41774 41774 Folk 466
https://api.deezer.com/album/43941 43941 Electro, Dance , Pop , Rock 106
https://api.deezer.com/album/48078 48078 Klassik , Filme/Videospiele, Filmmusik 98
https://api.deezer.com/album/71521 71521 Pop 132
https://api.deezer.com/album/71718 71718 Country , Alternative , Indie Rock , Pop , Indie Pop/Folk, Rock 84
https://api.deezer.com/album/72309 72309 NULL -1
https://api.deezer.com/album/72703 72703 NULL -1
https://api.deezer.com/album/74082 74082 Alternative , Indie Pop , Indie Rock , Pop , Indie Pop/Folk, Rock 85
https://api.deezer.com/album/74870 74870 NA NA
https://api.deezer.com/album/76052 76052 Alternative , Indie Rock , Pop , Indie Pop/Folk, Rock 85
https://api.deezer.com/album/77825 77825 Pop , Filme/Videospiele, Filmmusik , Folk 132
https://api.deezer.com/album/79401 79401 Dance, Pop , Rock 113
https://api.deezer.com/album/80386 80386 Pop , R&B , Spirituelles & Religion 132
https://api.deezer.com/album/80780 80780 Alternative , Indie Pop , Indie Rock , Pop , Indie Pop/Folk, Rock 85
https://api.deezer.com/album/82356 82356 Rap/Hip Hop , R&B , Contemporary Soul 116
https://api.deezer.com/album/82750 82750 Brasilianische Musik, Pop , International Pop 75
https://api.deezer.com/album/82947 82947 Pop , Rock , Indie Rock/Rock Pop 132
https://api.deezer.com/album/83144 83144 Pop , International Pop , Lateinamerikanische Musik 132
https://api.deezer.com/album/87478 87478 Rap/Hip Hop, Pop , Rock , R&B , Soul & Funk 116
https://api.deezer.com/album/87675 87675 Alternative 85
https://api.deezer.com/album/88463 88463 Alternative, Indie Rock , Electro , Dance , Pop , Rock 85
https://api.deezer.com/album/89448 89448 Klassik 98
https://api.deezer.com/album/90630 90630 Alternative , Indie Pop , Indie Rock , Pop , Indie Pop/Folk, Rock 85
https://api.deezer.com/album/94964 94964 Alternative , Indie Pop , Indie Rock , Pop , Indie Pop/Folk , International Pop, Rock 85
https://api.deezer.com/album/97328 97328 Pop , Reggae, Rock 132
https://api.deezer.com/album/97722 97722 Pop , Rock 132
https://api.deezer.com/album/97919 97919 Alternative , Indie Pop , Indie Rock , Pop , Indie Pop/Folk, Rock 85
https://api.deezer.com/album/99298 99298 Alternative , Indie Rock , Pop , Indie Pop/Folk, Rock 85
https://api.deezer.com/album/99495 99495 Dance 113
https://api.deezer.com/album/99692 99692 Pop 132
https://api.deezer.com/album/100086 100086 Heavy Metal 464
https://api.deezer.com/album/102450 102450 Pop 132
https://api.deezer.com/album/104223 104223 Rap/Hip Hop 116
https://api.deezer.com/album/104617 104617 Pop , Rock 132
https://api.deezer.com/album/106390 106390 Electro , Techno/House , Dance , Jazz , Pop , International Pop 106
https://api.deezer.com/album/109345 109345 Dance 113
https://api.deezer.com/album/109542 109542 Electro, Dance , Pop , Reggae , Disco 106
https://api.deezer.com/album/113876 113876 NA NA
https://api.deezer.com/album/116831 116831 Pop , International Pop, Rock 132
https://api.deezer.com/album/118013 118013 Country , Pop , Rock , Singer & Songwriter 84
https://api.deezer.com/album/118801 118801 Rock 152
https://api.deezer.com/album/119589 119589 Jazz 129
https://api.deezer.com/album/119983 119983 R&B 165
https://api.deezer.com/album/120574 120574 Pop , Rock 132
https://api.deezer.com/album/122544 122544 Jazz 129
https://api.deezer.com/album/123135 123135 Pop 132
https://api.deezer.com/album/124120 124120 Pop 132
https://api.deezer.com/album/124711 124711 Pop 132
https://api.deezer.com/album/124908 124908 Rap/Hip Hop, R&B 116
https://api.deezer.com/album/125105 125105 Jazz 129

So far we created a new DF called “albums” which has 4 columns: API-query, album_id, genre and genre_id, Where genre is the written name of the genre and the id is just the corresponding id. We can see, that some album have more than one genre, and some have NULL.

Looking back at the data, it would be quite easy to get the lists of genre out of the cells of the df. But as we started the project, we weren´t that fund of performing transformations on lists of lists. And since this report demonstrate what we have done and not what we should have done, here is our original approach.

We used gsub to get rid of all symbols.

albums[,'alb.genre2'] <- gsub("c\\(", "" , albums[,'alb.genre']) #remove "c("
albums[,'alb.genre2'] <- gsub("\\)", "" , albums[,'alb.genre2']) #remove ")"
albums[,'alb.genre2'] <- gsub("\"", "" , albums[,'alb.genre2']) #remove """
albums[,'alb.genre2'] <- gsub(" ", "" , albums[,'alb.genre2']) #remove " "

albums <- splitstackshape::cSplit(albums,splitCols = "alb.genre2",direction = "wide")
formattable(head(albums[,c(2,5,6,7)],10))
album_id alb.genre2_1 alb.genre2_2 alb.genre2_3
41774 Folk NA NA
43941 Electro Dance Pop
48078 Klassik Filme/Videospiele Filmmusik
71521 Pop NA NA
71718 Country Alternative IndieRock
72309 NULL NA NA
72703 NULL NA NA
74082 Alternative IndiePop IndieRock
74870 NA NA NA
76052 Alternative IndieRock Pop

At this stage we decided to only grab the very first of the listed genres for each albums, since it would make most sense, that the main genre is named first. We still have NULLs and NAs.

Our first approach in tackeling the NULLs and NAs was to get the albums corresponding artist ids. Than, with the artist information, we were able to lookup the artist´s most played genre to fill in missing data. For Example if artist 1 had 4 albums in genre “rock”, one could assume that a 5th album, which genre is missing is “rock” aswell.

albums$artist <- all$artist_id[albums$album_id %in% all$album_id] #joining artist informations

#save(albums,file="bums_art.rda")
#load("bums_art.rda")

At this point we need to load in data which has been saved during our running.

load("/home/Deezer/30_Wrangled_Data/bums_art.rda")
formattable(head(albums,10))
album_id alb.genre alb.genre.id alb.genre2_01 alb.genre2_02 alb.genre2_03 alb.genre2_04 alb.genre2_05 alb.genre2_06 alb.genre2_07 alb.genre2_08 alb.genre2_09 alb.genre2_10 alb.genre2_11 alb.genre2_12 alb.genre2_13 alb.genre2_14 alb.genre2_15 alb.genre2_16 artist
41774 Folk 466 Folk NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 55164
43941 Electro, Dance , Pop , Rock 106 Electro Dance Pop Rock NA NA NA NA NA NA NA NA NA NA NA NA 55830
48078 Klassik , Filme/Videospiele, Filmmusik 98 Klassik Filme/Videospiele Filmmusik NA NA NA NA NA NA NA NA NA NA NA NA NA 2704
71521 Pop 132 Pop NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 938
71718 Country , Alternative , Indie Rock , Pop , Indie Pop/Folk, Rock 84 Country Alternative IndieRock Pop IndiePop/Folk Rock NA NA NA NA NA NA NA NA NA NA 2939
72309 NULL -1 NULL NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 2939
72703 NULL -1 NULL NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 2939
74082 Alternative , Indie Pop , Indie Rock , Pop , Indie Pop/Folk, Rock 85 Alternative IndiePop IndieRock Pop IndiePop/Folk Rock NA NA NA NA NA NA NA NA NA NA 2939
74870 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 2939
76052 Alternative , Indie Rock , Pop , Indie Pop/Folk, Rock 85 Alternative IndieRock Pop IndiePop/Folk Rock NA NA NA NA NA NA NA NA NA NA NA 2939

Only on with this data you can see what splitstackshape::cSplit really did. It cutted the genres from the list of list and got each of them into a new column. Therefore just using the very first of this new columns is what we want, since it is the first entry of the genres. To get going we created a new DF which contains all the missing data (NAs and NULLs)

Na.albums <- albums %>% filter(alb.genre.id=="NA" | alb.genre.id==-1) #get all NAs and NULLs into seperate DF
Na.albums <- as.data.frame(Na.albums[,c(1,20)])
colnames(Na.albums) <- c("album_id","artist_id")
albums2 <- albums[!albums$album_id %in% Na.albums$album_id,] #get clean data into a new DF

As we can see, we have 20521 rows in Na.albums, which means that we have 20521 albums out of 151605 which doesnt have a genre after quering the API.

all2 <- all[,c(4,13)] #subset all to contain only album_id and artist_id
all2 <- left_join(all2,albums2[,c(1,4)],by="album_id") #joining our new genre

#creating artists loopuptable
loup.artists <- all2 %>%
  group_by(artist_id,alb.genre2_01) %>%
  summarise(n=n()) %>% #how often a genre at a artist appears. btw: listened/played songs more important than overall genres from albums released
  filter(n==max(n))

Na.albums <- merge(Na.albums,loup.artists[,-3],by="artist_id") #reduced from 20.000 NA albums to 999.

albums2 <- albums2[,c(20,1,4)]
albums2 <- bind_rows(albums2, Na.albums[!is.na(Na.albums$alb.genre2_01),-3]) #adding albums with new information to album2
Na.albums <- Na.albums[is.na(Na.albums$alb.genre2_01),] #999 albums still without genre

This implementation was a big success. We were able to make a valid guess on the genre_id by using the artist´s main genre, reducing the number of NAs and NULLs by more than 95%.

To fill the final 999 genres, we used the bpm, which we have collected earlier to identify the genres, but as mentioned aswell, we need what we know about the genre so far to fill our gaps in bpm knowledge adequatly.

First we get all songs where our bpm is 0, afterwards we first fill in the mean bpm of the album the song is in, and if we don’t have this information, we fill with the mean bpm of the genre.

Note: make sure you have run 2(b) already

#bpm.data <- all[!duplicated(all$media_id),c("media_id","album_id")]
bpm.data <- unique(all[,c("media_id","album_id")]) #get all media
bpm.data<- merge(bpm.data,tracks,by="media_id") #join bpm
bpm.data<- merge(bpm.data,albums2[,c(-1,-4)],by="album_id", all.x=TRUE) #join genre_id we know so far

#creation if album and genre lookuptables for the everage bpm
bpm.merge.album <- bpm.data %>% filter(track_bpm>0) %>% group_by(album_id) %>% summarise(track_bpm=mean(track_bpm))
bpm.merge.genre <- bpm.data %>% filter(track_bpm>0) %>% group_by(alb.genre2_01) %>% summarise(track_bpm=mean(track_bpm))

#fill in album_mean_bpm where it is known
bpmclean2 <- bpm.data[bpm.data$track_bpm==0,]
bpmclean2 <- merge(bpmclean2[,c(1,2,4)],bpm.merge.album, by="album_id",all.x=TRUE)

bpmclean3 <- bpmclean2[is.na(bpmclean2$track_bpm),] #where we dont have bpm data from the album
bpmclean3 <- merge(bpmclean3[,c(-4)],bpm.merge.genre, by="alb.genre2_01",all.x=TRUE) #join the information from the average_genre_lookup table

What was quite amusing is one very resilient little fellow: In all 7.5 mio songs in our data, there is only one “HörbuchaufDeutsch”. Since it is the only one of his kind, and he is an albums by himself, we don´t have any information, but we set the bpm to 80, which is on the lower end of the scale.

bpmclean3[is.na(bpmclean3$track_bpm),]
##            alb.genre2_01 album_id  media_id track_bpm
## 5899 HörbücheraufDeutsch 13903144 130948010        NA
bpmclean3[is.na(bpmclean3$track_bpm),4] <- 80

Now we only had to clean everything and bind it back into a single bpm lookup table, which we will call tracks_cleaned.rda

bpm1 <- bpm.data %>% filter(track_bpm>0)
bpm2 <- bpmclean2 %>% filter(!is.na(track_bpm))
bpm2 <- bpm2[,c(1,2,4,3)]
bpm3 <- bpmclean3[,c(2,3,4,1)]

tracks_clean <- rbind(bpm1,bpm2,bpm3)
tracks_clean <- tracks_clean[!duplicated(tracks_clean$media_id),]

tracks_clean2 <- tracks_clean[,c(2,3)]
#save(tracks_clean, file="data_for_genre_loop.rda")
#save(tracks_clean2,file="tracks_cleaned.rda")
load("/home/Deezer/30_Wrangled_Data/tracks_cleaned.rda")

myColors <- rep(brewer.pal(7,"Blues")[5:6],10)

ggplot(tracks_clean2, aes(x=track_bpm))+stat_bin(bins=20, fill=myColors, color="black")+
  scale_y_continuous(limits=c(0,80000),breaks=seq(0,80000,5000))+
  scale_x_continuous(breaks=seq(50,220,10),limits=c(50,220))+
labs(title="Distribution of Beats per Minute",
     y="observations",
      x="beats per minute",
      caption="Source: Deezer train data")+
theme_light()+
  theme(axis.text.x = element_text(angle = 75, hjust = 1))

#ggsave("BPM_distribution.png",plot=gg8, width = 8, height=5, units="in")

To fill the missing genre_ids for our 999 observations, we will conduct a glm model to predict the genre by the songs bpm, since we have already exhausted all other possibilities. According to a majority vote on how the genre_id is distributed on the known cases, we could assume, that the missing ones are “pop”.

load("/home/Deezer/30_Wrangled_Data/Deezer_train_0525.rda")
library(treemap)
#png(filename="genre_treemap.png",width=16, height=9, units = "in",res=72)
treemap(DeezerNew_train_0525 %>% group_by(alb.genre2_01) %>% summarise(n=log(n())), index="alb.genre2_01",vSize = "n",title = "Logarithmic Distribution of Genres", palette = "GnBu")

#dev.off()

Note: this code will run for some minutes (glm for 47 models). For the sake of creating our html report, we set the chunks eval=FALSE to prevent the chunk from running. You can skip the following two chunks without missing too much information.

test.loop <- tracks_clean %>% filter(is.na(alb.genre2_01)) %>% select(album_id,track_bpm)
results <- test.loop

#loop and fit
for (i in levels(tracks_clean$alb.genre2_01)){
    tracks_clean$y <- as.factor(as.numeric(tracks_clean$alb.genre2_01==i))
    train.loop <- tracks_clean %>% filter(!is.na(alb.genre2_01)) %>% select(track_bpm,y)

    fit <- glm(y~track_bpm,data=train.loop, family=binomial(link = "logit"))
    results[,i] <- predict(fit,test.loop, type="response")
}
results <- results[,-2]

At this point we have done a prediction for every missing album and can show, that all of them are considered to be “pop”.

albums.finish <- melt(results,id.vars="album_id")
albums.finish <- albums.finish %>% group_by(album_id,variable) %>% summarise(value=sum(value))
albums.finish <- albums.finish %>% group_by(album_id) %>% filter(value==max(value))
table(albums.finish$variable)

The tables shows that for all 20521 songs from out missing 999 albums, the looped glm model suggested, that all of the songs are so be considered “pop”, which also followed the majority vote.

Na.albums$alb.genre2_01 <- "Pop"
albums2 <- bind_rows(albums2[,c(2,3)], Na.albums[,c(2,3)]) #adding albums with new information to album2
albums2 <- albums2[!duplicated(albums2$album_id),]

#save(albums2,file="album_genre_clean.rda")

3 (e) User clusters by genre

load("/home/Deezer/30_Wrangled_Data/Deezer_train_0525.rda") #we load a newer version of the data at this stage, but we only build on what has been worked out so far.
DeezerNew_train_0525$is_listened <- as.numeric(DeezerNew_train_0525$is_listened)-1
library(dplyr)
library(splitstackshape)
library(reshape2)
## 
## Attaching package: 'reshape2'
## The following objects are masked from 'package:data.table':
## 
##     dcast, melt

One of our first thoughts about this project was: “the genre is the most important indicator if you like a song or not.” At this point in time we already had trustworthy information about the genre, so we could feature engineer further. In a first attempt we generated a clustering based on genre-listening-history. Sadly we cannot deliver the original code, because the Zeno Server crashed after running the code and deleted the at this point unsaved rmd file. Luckily, we saved the output of the code in a rda.

Here is an attempt to recreate the code to give you an impression:

profs <- DeezerNew_train_0525 %>% 
  group_by(user_id,alb.genre2_01) %>% #for each user and each genre
  summarise(c = sum(is_listened)) %>% #is_listened is 0/1 coded as.numeric, so sum() works fine
  mutate(p = c/sum(c)) #summarise ungroups once, so at this point we are at groub_by(user_id), therefore sum(c)==sum(c | user(i))

profs <- profs[,-3] %>%
  dcast(user_id ~ alb.genre2_01, value.var="p") #dcast is the invers of melt, but creating a new column for each value pair.
profs[is.na(profs)] <- 0
profs[1:10,c(1,3,13,17,29,41)]
##    user_id Alternative      Dance     Electro        Jazz        Rock
## 1        0 0.117741935 0.12338710 0.247741935 0.066290323 0.015483871
## 2        1 0.178102190 0.07023520 0.096025953 0.089213301 0.026277372
## 3       10 0.014770241 0.13676149 0.054704595 0.000547046 0.003829322
## 4      100 0.078269825 0.08444902 0.024716787 0.013388260 0.019567456
## 5     1000 0.201660735 0.10083037 0.186239620 0.000000000 0.059311981
## 6    10000 0.012658228 0.01265823 0.012658228 0.012658228 0.025316456
## 7    10001 0.000000000 0.00000000 0.000000000 0.000000000 0.000000000
## 8    10002 0.097297297 0.01621622 0.021621622 0.037837838 0.075675676
## 9    10003 0.000000000 0.03623188 0.007246377 0.007246377 0.000000000
## 10   10004 0.009174312 0.00000000 0.000000000 0.000000000 0.009174312

This chunk gave us the opportunity to look how genres are distributed for every user individual. e.g. User 1 listened 11% “Alternative”, 12% “Dance”, 24% Electro and so on.

We used this information to create clusters of similar users, using hclust over kmeans, because we didnt know how many clusters we wanted to come up with and didn´t wanted to rerun the code multiple times.

The code looked somewhat like this, note that creating the distance-matrix will take around 2 minutes on the zeno-server and around 5-10 minutes on a regular machine, since it consists of 198 mio elements.

d <- dist(profs[,-1])

We want to use this journey back into our beginnings to show how different clustering methods looked like. However, we used the method “complete” linkage (which means, that the maximum distance will be used to link two clusters) and set the cut-off threshold to 35 clusters.

c.tree.comp = hclust(d,method="complete")

w.tree.comp = hclust(d,method="ward.D2")

s.tree.comp = hclust(d,method="single")
par(mfrow=c(1,3)) 
plot(c.tree.comp)
plot(w.tree.comp)
plot(s.tree.comp)

labs.comp = cutree(c.tree.comp,k=35)
#lookup <-as.data.frame(profs$user_id)
#lookup$profile_id <- labs.comp
#colnames(lookup) <- c("user_id","profile_id")

This new feature added to our previously best model increased our score by 4%.

We didn´t stop at this point, since there are some aspects which are worrying. e.g. that we don’t take into account the number of observations or how often a genre was played, in contrast, we only look at how often a user has listened to a genre. We kept on developing the idea, which is worth a chapter on its own.

3 (f) User Behavior Index

load("/home/Deezer/30_Wrangled_Data/Deezer_train_0525.rda")
DeezerNew_train_0525$is_listened <- as.numeric(DeezerNew_train_0525$is_listened)-1
library(dplyr)
library(splitstackshape)
library(reshape2)

Sometimes a simple idea can lead to a cascade of thoughts and breakthroughs, as happened in our case. The thought was a simple one: “When we look at the distribution of genres for each unique user and cluster them, don´t we actually ignore the preferences of the individual user?” What for example happens to a user, who has multiple favorite genres? Imagine a user who listens only to 5 genres in a equally distributed fashion - each of those genres have a score of 0.2 in our previous approach. Now imagine another user who listens to two genres, one 80% and one 20%. At the moment we propose, that the genres with a score of .2 are equally important, even through it is clearly not the case: For user 1, every genre is his favorite, while for user 2 the second genre is not his favorite.

How to bypass this problematic? By scaling our previous results user-individual. That means, we will take a users favorite genre (the max() of his scores), and set it to 1 and scale all other genres accordingly. With this transformation we are able to compare different users. Going back to the example from above: For user 1, each of his/her 5 genres is now a score of 1 while for user 2 only the first genre is 1 and the second is scaled respectively.

profs <- DeezerNew_train_0525 %>% 
  group_by(user_id,alb.genre2_01) %>% #for each user and each genre
  summarise(c = sum(is_listened)) %>% #is_listened is 0/1 coded as.numeric, so sum() works fine
  mutate(p = c/sum(c)) #summarise ungroups once, so at this point we are at groub_by(user_id), therefore sum(c)==sum(c | user(i))

profs <- profs[,-3] %>%
  dcast(user_id ~ alb.genre2_01, value.var="p")
profs[is.na(profs)] <- 0 

scaled <- as.data.frame(t(apply(profs[,-1], 1, function(x)(x-min(x))/(max(x)-min(x))))) #using min(x) instead of 0 for convinience, since for all observations min(x)==0
scaled$user_id <- profs$user_id
scaled[is.na(scaled)] <-0 #for users who have never listened to anything

#creating a lookuptable for final models, better not run it again, it takes ages.

#genre_scaled.molten <- melt(scaled, id.vars="user_id",variable.name="alb.genre2_01",value.name="genre_scaled")
#save(genre_scaled.molten,file="30_05_genre_scaled_lookup.rda")
tmp <- melt(profs)
ggplot(tmp %>% filter(user_id==15980 | user_id==19378 | user_id==10971), aes(x=variable,y=value, fill=user_id))+
  geom_bar(stat="identity", position="dodge")+
  scale_y_continuous(breaks=seq(0,0.65,0.05))+
  scale_fill_brewer(palette="Set1")+
  labs(x="genre",y="relative frequency", title="Relative Genre Frequency",subtitle="for three selected users")+
  theme_light()+
  theme(axis.text.x = element_text(angle = 75, hjust = 1))

#ggsave("relative_genre.png",plot=gg1, width = 8, height=5, units="in")
tmp <- melt(scaled)
ggplot(tmp %>% filter(user_id==15980 | user_id==19378| user_id==10971), aes(x=variable,y=value, fill=user_id))+
  geom_bar(stat="identity", position="dodge")+
  scale_y_continuous(breaks=seq(0,1,0.05))+
  
  scale_fill_brewer(palette="Set1")+
  labs(x="genre",y="scaled relative frequency",title="Scaled Relative Genre Frequency",subtitle="for three selected users")+
  theme_light()+
  theme(axis.text.x = element_text(angle = 75, hjust = 1))

#ggsave("scaled_relative_genre.png",plot=gg2, width = 8, height=5, units="in")

As told, a idea can lead to a cascade of ideas, here is our second thought: “What if the DeezerFlow-Algorithm suggested a genre very often, but it wasn´t listened to often, even through still often enough to let it appear as a favorite genre for this user?”. To get a sense for this possibility, we created a new table, with the percentage of how often a genre is listened when played for every user.

profs <- DeezerNew_train_0525 %>% 
  group_by(user_id,alb.genre2_01) %>% #for each user and genre
  summarise(c = sum(is_listened),n=n(),p=c/n) #sum(is_listened) again, but this time n() is equal to count() and p as the quotient of c/n

profs <- profs[,-c(3,4)] %>%
  dcast(user_id ~ alb.genre2_01, value.var="p") #long to wide format
profs[is.na(profs)] <- 0 #NAs are equal to 0
listened <- profs[,-1]
listened$user_id <- profs$user_id

#genre_listened.molten <- melt(listened, id.vars="user_id",variable.name="alb.genre2_01",value.name="genre_listened")
#save(genre_listened.molten,file="30_05_genre_listened_lookup.rda")
tmp <- melt(listened)
ggplot(tmp %>% filter(user_id==15980 | user_id==19378 | user_id==10971), aes(x=variable,y=value, fill=user_id))+
  geom_bar(stat="identity", position="dodge")+
  scale_y_continuous(breaks=seq(0,1,0.05),limits = c(0,1))+
  scale_fill_brewer(palette="Set1")+
  labs(x="genre",y="average is_listened",title="Average is_listened per Genre",subtitle="for three selected users")+
  theme_light()+
  theme(axis.text.x = element_text(angle = 75, hjust = 1))

#ggsave("average_is_listened.png",plot=gg3, width = 8, height=5, units="in")

Combining the scaled genre preferences as well as the listening history is exactly what we had in mind. As one of the value descreases, the overall trustworthyness of the data is going down. As long as both values are high or extremely low, we can be pretty sure, that a user has listened or not listened to the song respectively.

One aspect we haven’t considered yet, is the sample size. Until this point we cannot differentiate between a user, who has listened for example to only one song once (scaled value and listenening bahavior is 1) and someone who has heard to a genre hundreds of time.

To compensate this in some way, we extracted this information aswell:

profs <- DeezerNew_train_0525 %>% 
  group_by(user_id,alb.genre2_01) %>% #for each user and genre
  summarise(n=n()) %>% #count
  dcast(user_id ~ alb.genre2_01, value.var="n") #long to wide

profs[is.na(profs)] <- 0
counts <- profs[,-1]
counts$user_id <- profs$user_id

#genre_counts.molten <- melt(counts, id.vars="user_id",variable.name="alb.genre2_01",value.name="genre_counts")
#save(genre_counts.molten,file="30_05_genre_counts_lookup.rda")
tmp <- melt(counts)
ggplot(tmp %>% filter(user_id==15980 | user_id==19378| user_id==10971), aes(x=variable,y=value, fill=user_id))+
  geom_bar(stat="identity", position="dodge")+scale_y_continuous(breaks=seq(0,90,5),limits = c(0,90))+
  scale_fill_brewer(palette="Set1")+
  labs(x="genre",y="Count",title="Count per Genre",subtitle="for three selected users")+
  theme_light()+
  theme(axis.text.x = element_text(angle = 75, hjust = 1))

#ggsave("count.png",plot=gg4, width = 8, height=5, units="in")

Test, if columns and user_ids are arranged identically for each of the three tables.

sum(names(listened) != names(scaled)) #should be 0
## [1] 0
sum(names(listened) != names(counts))
## [1] 0
sum(scaled$user_id!=listened$user_id)
## [1] 0
sum(scaled$user_id!=counts$user_id)
## [1] 0

In our last step we combined the three different values. To show you our thought process throughout the report, we will show you how the combination changed over some testing and iterative feedback rounds.

To demonstrate what drove the changes, we will create a set of easy example data. Where:

g <- c(1,.8,.6,.3) #Genre-Distribution
l <- c(.25,.7,.8,.95) #Listening-History
c <- c(200,68,45,19) #Count

Our first approach was glsqrt(c) using c as a weight to get higher index values, when we have more data available. We had never the intention to keep the value inside a 0:1 boundry

g*l*sqrt(c)
## [1] 3.535534 4.617878 3.219938 1.242286

As we can see, there is not enough emphasis on the last case. The last genre has 95% listening rate over 19 observations, if this genre would be played again, we could pretty sure predict, that the user is going to listen to it.

Additionally case 1 still has the second highest value, which should indicate a high likelyhood of “is_listened” for a future song of this genre. But the listening value is only 0.25, which indicates, that it is rather unlikely, that the user is going to listen to the genre next time.

In a first step, we exchanged the sqrt() to a log(x+1) function, which gives less weight relatively to the growth in count, the +1 inside the log is needed because log(0) otherwise would lead to -Inf.

par(mfrow=c(1,2))
plot(log(1:1000))
plot(sqrt(1:1000))

g*l*log(c+1)
## [1] 1.3258262 2.3710996 1.8377479 0.8537837

We can see, that the values are closer together now, still, the log is a strong enough penalty for low count cases. As wanted, case 1 has a lower value than before, since we don´t give him that much weight through his high n.

Since we wanted more emphasis on the listening history, we thought about assigning a weight manually, something like:

weight =5
g*(l*weight)*log(c+1)
## [1]  6.629131 11.855498  9.188739  4.268918

But tweaking a imaginairy value isn’t what we wanted to do. A more solid way of getting more emphasis on the listening behavior, was to shrink the g by using the sqrt(g). The Squareroot performed well, because g is ranged between 1:0, so sqrt(max(g))==1 and sqrt(min(g))==0 and every value for g which is <1 will get smaller, which in turn is more weight on l. Additionally it has a stronger impact on lower values, which is a benefitial side effect.

plot(sqrt(seq(1,0,length.out = 100)))

sqrt(g)*l*log(c+1)
## [1] 1.325826 2.650970 2.372522 1.558789

In this final version we can see, that case 4 is more important than case 1, but still, due to the small sample size not too overdone.

Here is the final code, which computes the behavior_index:

behavior <- sqrt(scaled[,-47])*listened[,-47]*log(counts[,-47]+1)
behavior$user_id <- scaled$user_id
#behavior_molten <- melt(behavior, id.vars="user_id",variable.name="alb.genre2_01",value.name="behavior")
#save(behavior_molten,file="25_05_new_behavior_lookup.rda")
tmp <- melt(behavior)
ggplot(tmp %>% filter(user_id==15980 | user_id==19378| user_id==10971), aes(x=variable,y=value, fill=user_id))+
  geom_bar(stat="identity", position="dodge")+scale_y_continuous(breaks=seq(0,4,0.25),limits = c(0,4))+
  scale_fill_brewer(palette="Set1")+
  labs(x="genre",y="behavior index",title="Behavior Index",subtitle="for three selected users")+
  theme_light()+
  theme(axis.text.x = element_text(angle = 75, hjust = 1))

#ggsave("behavior_index.png",plot=gg5, width = 8, height=5, units="in")
tmp <- cbind.data.frame(sqrt(seq(0,1,length.out = 100)),seq(0,1,length.out = 100))
colnames(tmp) <- c("after","before")
ggplot(tmp,aes(x=before,y=after))+geom_point()+
  scale_y_continuous(breaks=seq(0,1,0.1),limits = c(0,1))+
  scale_x_continuous(breaks=seq(0,1,0.1),limits = c(0,1))+
  scale_fill_brewer(palette="Set1")+
  labs(x="x",y="sqrt(x)",title="Sqrt(x) versus x")+
  theme_light()+
  theme(axis.text.x = element_text(angle = 75, hjust = 1))

#ggsave("sqrt.png",plot=gg6, width = 8, height=5, units="in")

As you can imagine, the Behavior-Matrix is sparsely filled, since unlistened genres are set to 0, because g will be 0, even if c is not. Higher values of Behavior_index for a genre is indicating a higher likelihood of getting heard again by the user, while values close to 0 are indicating a low likelihood, respectively. We used the data for clustering as well as a feature itself. To use it as a feature we melt() the matrix and saved it as a lookup table, which we will later left join to the data using user_id and genre_id as a key.

formattable(behavior[1:10,1:10])
AfrikanischeMusik Alternative AsiatischeMusik Blues Bolero BrasilianischeMusik Comedy ContemporaryR&B ContemporarySoul Corridos
0.1919238 4.42734929 0.00000000 0.01754381 0 0.1206416 0 0.08147095 0.41231820 0
0.0000000 4.82799863 0.01150143 0.00000000 0 0.3301773 0 0.06739716 0.05027516 0
0.2037738 0.26026544 0.00000000 0.00000000 0 0.1207933 0 0.00000000 0.00000000 0
0.0000000 0.78374923 0.00000000 0.00000000 0 0.0000000 0 0.00000000 0.00000000 0
0.3581292 2.67147990 0.00000000 0.00000000 0 0.0000000 0 0.00000000 0.00000000 0
0.3856975 0.05425414 0.00000000 0.00000000 0 0.0000000 0 0.00000000 0.00000000 0
0.0000000 0.00000000 0.00000000 0.00000000 0 0.0000000 0 0.00000000 0.00000000 0
0.0000000 1.21911356 0.00000000 0.00000000 0 0.0000000 0 0.00000000 0.00000000 0
2.1302331 0.00000000 0.00000000 0.00000000 0 0.0000000 0 0.00000000 0.00000000 0
0.0000000 0.03583519 0.00000000 0.00000000 0 0.0000000 0 0.00000000 0.00000000 0

3 (g) User Behavior Index for Clustering users

Finally, we can replace the previous clustering by clustering across the users Behavior_index. If you havn´t read 2(d), you should go back and read the explanations for our clustering approaches and codes.

We want to use this journey back into our beginnings to show how different clustering methods looked like. In the end, we used the method “complete” and set the cutoff point to 35 clusters.

d <- dist(behavior[,-47]) 
c.tree.comp = hclust(d,method="complete")

w.tree.comp = hclust(d,method="ward.D2")

s.tree.comp = hclust(d,method="single")
#png(filename="clustering_types.png",width=16, height=9, units = "in",res=72)
par(mfrow=c(1,3)) 
plot(c.tree.comp)
plot(w.tree.comp)
plot(s.tree.comp)

#dev.off()

#png(filename="clustering_complete.png",width=16, height=9, units = "in",res=72)
plot(c.tree.comp)
#dev.off()

tree.comp = hclust(d,method="complete")
labs.comp = cutree(tree.comp,k=35)
cluster_user_lookup <-as.data.frame(behavior$user_id)
cluster_user_lookup$profile_id <- labs.comp
colnames(cluster_user_lookup) <- c("user_id","profile_id")
# save(cluster_user_lookup, file="23_05_SQRT_35_Clusters.rda")
load("/home/Deezer/30_Wrangled_Data/23_05_SQRT_35_Clusters.rda")
myColors <- c(rep(brewer.pal(7,"Blues")[5:6],17),"#4292C6")

ggplot(cluster_user_lookup, aes(x=profile_id))+stat_bin(binwidth=1,fill=myColors, color="black")+scale_y_log10()+
  scale_x_continuous(breaks=seq(1,35,1))+
labs(y="log10 of observations",
      x="profile cluster",
      title="Logarithmic Distribution of Profile Belonging",
      caption="Source: Feature Engineering Behavior_index Clustering")+
theme_light()+theme(axis.text.x = element_text(angle = 75, hjust = 1))

#ggsave("distribution_profiles.png",plot=gg13, width = 8, height=5, units="in")

The profiles were saved in a lookuptable and joined onto the data. Overall, adding the behavior_index as well as behavior_profile increased our accuracy by a significant amount. We were lucky and could use the data for g, l and c as additional Features.

3 (h) Using the Behavior Architecture to Create Further Features

A day before the deadline, after the meeting with Prof. Löcher and to his suggestion we used the existing behavior_index architecture to create new features, not just for genre, but also for artist and album. In this report we set the following code to eval=FALSE, because it is very timeconsuming. For Example: The lookup table for artists/user-behavior has roughly 1.3e+09 rows, the lookup for album 2.97e+09 rows.

profs <- DeezerNew_train_0525 %>% 
  group_by(user_id,artist_id) %>% 
  summarise(c = sum(is_listened)) %>% 
  mutate(p = c/sum(c))
profs <- profs[,-3] %>%
  dcast(user_id ~ artist_id, value.var="p")
profs[is.na(profs)] <- 0
scaled <- as.data.frame(t(apply(profs[,-1], 1, function(x)(x-min(x))/(max(x)-min(x)))))
scaled$user_id <- profs$user_id
scaled[is.na(scaled)] <-0 #for users who have never listened to anything

artist_scaled.molten <- melt(scaled, id.vars="user_id",variable.name="artist_id",value.name="artist_scaled")
#save(artist_scaled.molten,file="30_05_artist_scaled_lookup.rda")

profs <- DeezerNew_train_0525 %>% 
  group_by(user_id,artist_id) %>% 
  summarise(c = sum(is_listened),n=n(),p=c/n)
profs <- profs[,-c(3,4)] %>%
  dcast(user_id ~ artist_id, value.var="p")
profs[is.na(profs)] <- 0
listened <- profs[,-1]
listened$user_id <- profs$user_id

artist_listened.molten <- melt(listened, id.vars="user_id",variable.name="artist_id",value.name="artist_listened")
#save(artist_listened.molten,file="30_05_artist_listened_lookup.rda")

profs <- DeezerNew_train_0525 %>% 
  group_by(user_id,artist_id) %>% 
  summarise(n=n()) %>%
  dcast(user_id ~ artist_id, value.var="n")
profs[is.na(profs)] <- 0
counts <- profs[,-1]
counts$user_id <- profs$user_id

artist_counts.molten <- melt(counts, id.vars="user_id",variable.name="artist_id",value.name="artist_counts")
#save(artist_counts.molten,file="30_05_artist_counts_lookup.rda")

behavior_artist <- sqrt(scaled[,-67143])*listened[,-67143]*log(counts[,-67143]+1)
behavior_artist$user_id <- scaled$user_id
behavior_artist_molten <- melt(behavior_artist, id.vars="user_id",variable.name="artist_id",value.name="behavior_artist")
#save(behavior_artist_molten,file="30_05_behavior_artist_lookup.rda")

Here is the same code for album_id:

profs <- DeezerNew_train_0525 %>% 
  group_by(user_id,album_id) %>% 
  summarise(c = sum(is_listened)) %>% 
  mutate(p = c/sum(c))
profs <- profs[,-3] %>%
  dcast(user_id ~ album_id, value.var="p")
profs[is.na(profs)] <- 0
scaled <- as.data.frame(t(apply(profs[,-1], 1, function(x)(x-min(x))/(max(x)-min(x)))))
scaled$user_id <- profs$user_id
scaled[is.na(scaled)] <-0 #for users who have never listened to anything

album_scaled.molten <- melt(scaled, id.vars="user_id",variable.name="album_id",value.name="album_scaled")
#save(album_scaled.molten,file="30_05_album_scaled_lookup.rda")

profs <- DeezerNew_train_0525 %>% 
  group_by(user_id,album_id) %>% 
  summarise(c = sum(is_listened),n=n(),p=c/n)
profs <- profs[,-c(3,4)] %>%
  dcast(user_id ~ album_id, value.var="p")
profs[is.na(profs)] <- 0
listened <- profs[,-1]
listened$user_id <- profs$user_id

album_listened.molten <- melt(listened, id.vars="user_id",variable.name="album_id",value.name="album_listened")
#save(album_listened.molten,file="30_05_album_listened_lookup.rda")

profs <- DeezerNew_train_0525 %>% 
  group_by(user_id,album_id) %>% 
  summarise(n=n()) %>%
  dcast(user_id ~ album_id, value.var="n")
profs[is.na(profs)] <- 0
counts <- profs[,-1]
counts$user_id <- profs$user_id

albums_counts.molten <- melt(counts, id.vars="user_id",variable.name="album_id",value.name="album_counts")
#save(album_counts.molten,file="30_05_album_counts_lookup.rda")

behavior_album <- sqrt(scaled[,-ncol(scaled)])*listened[,-ncol(listened)]*log(counts[,-ncol(listened)]+1)
behavior_album$user_id <- scaled$user_id
behavior_album_molten <- melt(behavior_album, id.vars="user_id",variable.name="album_id",value.name="behavior_album")
#save(behavior_album_molten,file="30_05_behavior_album_lookup.rda")

Sadly the computation wasn´t fast enough to enable us to get a final prediction with the new features. But we ran some hypertuned models after the competition and will present the results later on.

knitr::knit_exit()
rm(list = ls())